Statement of Research Interests

نویسنده

  • Ariel Fuxman
چکیده

When we integrate data from multiple sources, the resulting database may contain errors and inconsistencies. Ideally, all errors should be cleaned automatically. In reality, though, data cleaning necessarily requires human intervention. To facilitate this, I envision a data integration process where it is possible to obtain meaningful and consistent answers from an integrated database even if it is partially dirty. The challenge of query answering over dirty databases is that we cannot naively reuse existing database technology, which is designed to work on clean databases. For this reason, I designed, implemented and evaluated ConQuer [SIGMOD05,VLDB05], a scalable system for query answering over dirty databases. This system helps users take advantage of the query results in order to interactively clean the integrated database. While it is well known how to answer queries over clean databases, it is necessary to give meaning to the notion of clean answers over dirty databases. In my thesis, I explore two different semantics for clean answers. One [ICDT05] is based upon certain answers, a concept which is prevalent in the data integration theory literature; the other [ICDE06], which is drawn from the area of probabilistic databases, assigns each tuple a probability of being clean. The key contribution of my work is to bridge the gap between theory and practice by providing an efficient and scalable system to obtain clean query answers from dirty databases. In order to compute “clean” answers efficiently, ConQuer builds upon existing database management technology. In particular, it implements a query rewriting approach. That is, given a query q, ConQuer rewrites q into another query Q∗ such that Q∗ always retrieves the “clean” answers for q on every dirty database. The rewritten query is a SQL query that can be executed using any commercial Database Management System (DBMS). ConQuer also provides optimization techniques that would never be found by standard query optimizers which are in general unaware of data conflicts. In an extensive set of experiments, I showed that the rewritten queries have little overhead when compared to the original (non-rewritten) ones. ConQuer provides an interface that enables the user to gradually clean the database. In particular, when a query is submitted, the system shows the clean answers together with a query explanation. The explanation can be extremely valuable, since it often points to underlying errors in the database that require attention from the user. In my thesis work, I considered integrated databases that may violate a set of primary key constraints. Such databases are present in most organizations. For example, in the domain of Customer Relationship Management (CRM), data sources often contain conflicting information about the same customer. Notably, commercial CRM tools provide limited support for merging tuples corresponding to the same customer into one tuple in the integrated database. Although they typically support some form of conflict resolution rules (e.g., rules that take the average between two conflicting incomes of the same customer), these rules may be difficult to design.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Correction: Implications of Hybridization, NUMTs, and Overlooked Diversity for DNA Barcoding of Eurasian Ground Squirrels

The Funding statement was erroneously left out of this publication. The publisher apologizes for this error. The Funding statement should read: "This research was supported by the Russian Foundation for Basic Research (RFBR; grant 14-04-00301) to SVT. Funding for DNA barcoding analysis at the Biodiversity Institute of Ontario was provided by grants from the Natural Sciences and Engineering Rese...

متن کامل

Jean Gourd Statement of Research

I believe that research and teaching collectively support one another. Research strengthens teaching; teaching sustains research. As a result, a lot of what drives my research is not only based on my personal interests within the field of computer science, but is also influenced by what I teach and, perhaps more specifically, the feedback that I receive from the students that I teach. I find th...

متن کامل

Personal Statement and Outline of Proposed Research to support a PhD application

My primary research interests lie in the fields of operating systems, distributed systems and programming languages. I also have interests in software engineering, networks, continuous media applications, sentient environments and human-computer interaction. I intend to pursue a career in research, and during 2005–6 have been a Research Assistant at the Computer Laboratory of the University of ...

متن کامل

Statement on Research

My research interests are primarily in software testing, with a focus on automated strategies and empirical methodologies for web application testing. I am also broadly interested in software engineering and distributed systems. In this statement, I present an overview of my doctoral thesis, extensions to my thesis, future research directions, and my experiences and philosophy on mentoring unde...

متن کامل

Statement of Research Interests and Goals

My research interests and activities over the past six years have mainly centered around three areas: Databases, Middleware for Distributed and Mobile Applications, and Hybrid Intelligent Systems and its Applications. In addition to these research areas, I was also actively involved in a NSF funded "Digital Library" project that created a central repository of educational materials for Computer...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005